Feature Engineering (FE) & EDA Assignment¶

Zhenyu Wang¶

UNI: zw2847¶

Part 1(takeaways for the paper reading)¶

The artical that I chose is "Data-snooping, technical trading rule performance and the bootstrap"¶

Summary¶

  • This article discusses the performance of technical trading rules and employs White's Reality Check Bootstrap method to assess the performance of these rules while quantifying the impact of data snooping bias, fully adjusting for its influence across the entire range of rule selection. The article extends the research by Brock, Lakonishok, and LeBaron (1992), applying 26 trading rules to 100 years of daily data for the Dow Jones Industrial Average and determining the impact of data snooping. The article finds that during the BLL research period, certain trading rules indeed perform well, even when considering the influence of data snooping. The article also addresses the issue of data snooping and proposes using the Bootstrap method to address this problem.
  • The core conclusion of this paper is the importance of accounting for data exploration effects in assessing financial performance. Despite a specific trading rule being able to generate nearly 10% annual excess returns during the sample period and having a p-value of 0.04 in isolation, its effective data exploration-adjusted p-value is actually 0.90 due to it being part of a broad set of rules. Contrasts using the Sharpe ratio criterion are more striking: here, data exploration-adjusted and unadjusted p-values are 0.99 and 0.000, respectively (below 0.002). Additionally, the paper calculates out-of-sample performance based on recursive decision rules. In this rule, trading signals are generated based on the rule that produced the highest cumulative wealth following the previous trading day. Table VI provides summary statistics for the best rule and cumulative wealth rule for out-of-sample DJIA (1987-1996) and S&P 500 futures (1984-1996). These rules were selected based on average returns. Interestingly, during both out-of-sample periods, the cumulative wealth rule performed poorly. In fact, the cumulative wealth rule applied to S&P 500 futures resulted in negative returns. Furthermore, it is worth noting that the best rule for DJIA only traded six times, with an average holding period of over 400 days for each trade. This is significantly longer than the average holding period of 4.3 days for the best rule over the full 100-year sample.

Part 2 (time-series data analysis)¶

Loading necessary packages¶

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Loading time-series data (I am using Amazon Stock Price ALL-TIME)¶

In [2]:
dat = pd.read_csv('Amazon.csv')
dat = dat[['Date','Close']]
dat.columns = ['ds','y']
dat.tail()
Out[2]:
ds y
6150 2021-10-21 3435.010010
6151 2021-10-22 3335.550049
6152 2021-10-25 3320.370117
6153 2021-10-26 3376.070068
6154 2021-10-27 3396.189941
In [3]:
plt.figure(figsize=(12, 6))
plt.plot(dat.index, dat['y'], label='Amazon Stock Price')
plt.xlabel('Time')
plt.ylabel('Price')
plt.title('Time Series Plot')
plt.legend()
plt.show()

1. Simple Moving Average (SMA Method)¶

In [4]:
dat['SMA'] = dat.iloc[:,1].rolling(window=100).mean()
dat['diff'] = dat['y'] - dat['SMA']
dat[['y','SMA']].plot()

print(f"There are {len(dat)} cols in this dataset.\n")
print("Since the data contains 6155 cols, I am setting \"window\" parameter to be 100, \nwhich means using the avg of past 100 data points to ploting the SMA line.")
There are 6155 cols in this dataset.

Since the data contains 6155 cols, I am setting "window" parameter to be 100, 
which means using the avg of past 100 data points to ploting the SMA line.
In [5]:
dat['diff'].hist()
plt.title('The distribution of diff')

print("After calculate the differences between the actual and the SMA. \nThe histogram shows the majority of the data are above or below the SMA by about 200.")
After calculate the differences between the actual and the SMA. 
The histogram shows the majority of the data are above or below the SMA by about 200.
In [6]:
dat['upper'] = dat['SMA'] + 200
dat['lower'] = dat['SMA'] - 200
dat[100:200]
Out[6]:
ds y SMA diff upper lower
100 1997-10-07 4.057292 2.367604 1.689688 202.367604 -197.632396
101 1997-10-08 4.005208 2.390365 1.614843 202.390365 -197.609635
102 1997-10-09 3.750000 2.410781 1.339219 202.410781 -197.589219
103 1997-10-10 3.901042 2.433438 1.467604 202.433438 -197.566562
104 1997-10-13 4.000000 2.459167 1.540833 202.459167 -197.540833
... ... ... ... ... ... ...
195 1998-02-24 5.406250 4.604948 0.801302 204.604948 -195.395052
196 1998-02-25 5.489583 4.619635 0.869948 204.619635 -195.380365
197 1998-02-26 6.062500 4.640156 1.422344 204.640156 -195.359844
198 1998-02-27 6.416667 4.664167 1.752500 204.664167 -195.335833
199 1998-03-02 6.354167 4.686458 1.667709 204.686458 -195.313542

100 rows × 6 columns

In [7]:
def plot_it():
    plt.plot(dat['y'],'go',markersize=2,label='Actual')
    plt.fill_between(
       np.arange(dat.shape[0]), dat['lower'], dat['upper'], alpha=0.5, color="r",
       label="Predicted interval")
    plt.xlabel("Ordered samples.")
    plt.ylabel("Values and prediction intervals.")
    plt.show()
    
plot_it()

print("Above is the tolerance band which has revealed the outliers.\n")
print("After we draw the tolerance band, we can clearly see the trend within our dataset.\nTo concludes, the Amazons stock price using the SMA method reveals that a clear upward patterns.")
Above is the tolerance band which has revealed the outliers.

After we draw the tolerance band, we can clearly see the trend within our dataset.
To concludes, the Amazons stock price using the SMA method reveals that a clear upward patterns.

2. Exponential Smoothing Method¶

In [8]:
from statsmodels.tsa.api import SimpleExpSmoothing
import pandas as pd
import numpy as np
In [9]:
dat = pd.read_csv('Amazon.csv')
dat = dat[['Date','Close']]
dat.columns = ['ds','y']
dat.tail()
Out[9]:
ds y
6150 2021-10-21 3435.010010
6151 2021-10-22 3335.550049
6152 2021-10-25 3320.370117
6153 2021-10-26 3376.070068
6154 2021-10-27 3396.189941
In [10]:
EMAfit = SimpleExpSmoothing(dat['y']).fit(smoothing_level=0.2, optimized=False)
EMA = EMAfit.forecast(3).rename(r'$\alpha=0.2$')
dat['EMA'] = EMAfit.predict(start=0)
dat['diff'] = dat['y'] - dat['EMA']

plt.figure(figsize=(10, 6))
plt.plot(dat['y'], label='Original Data', marker='o', linestyle='-', color='b', alpha=0.7)
plt.plot(dat['EMA'], label='EMA', marker='o', linestyle='-', color='r', alpha=0.7)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.title('Original Data vs. EMA Smoothed Data')
plt.grid(True)
plt.tight_layout()
plt.show()

print("By setting the same smoothing level as notes did,\nwhich is 0.2, means that the EMA is giving more weight to recent observations in the time series.\nIn this case, the most recent data point in the series will have a weight of 20% in the calculation of the EMA.\nThe second most recent data point will have a weight of 20% * (1 - 0.2) = 16%.\nThe third most recent data point will have a weight of 16% * (1 - 0.2) = 12.8%.\nSo on so forth.\n")
print("A lower smoothing factor like 0.2 results in a more responsive EMA that reacts quickly to changes in the data,\nwhile a higher smoothing factor like 0.5 would make the EMA smoother and less responsive to short-term fluctuations. ")
By setting the same smoothing level as notes did,
which is 0.2, means that the EMA is giving more weight to recent observations in the time series.
In this case, the most recent data point in the series will have a weight of 20% in the calculation of the EMA.
The second most recent data point will have a weight of 20% * (1 - 0.2) = 16%.
The third most recent data point will have a weight of 16% * (1 - 0.2) = 12.8%.
So on so forth.

A lower smoothing factor like 0.2 results in a more responsive EMA that reacts quickly to changes in the data,
while a higher smoothing factor like 0.5 would make the EMA smoother and less responsive to short-term fluctuations. 
In [11]:
dat['diff'].hist()
plt.title('The distribution of diff')

print("We can observe that the predictions and the histogram of the EMA is different from SMA.\nAt this time I used 100 to get the tolerance band.")
We can observe that the predictions and the histogram of the EMA is different from SMA.
At this time I used 100 to get the tolerance band.
In [12]:
dat['upper'] = dat['EMA'] + 100
dat['lower'] = dat['EMA'] - 100
plot_it()

3. Seasonal-Trend Decomposition (STD Method)¶

In [13]:
import pandas as pd
import statsmodels.api as sm

Trend Component¶

In [14]:
dat = pd.read_csv('Amazon.csv')
dat = dat[['Date', 'Close']]
dat.columns = ['ds', 'y']
dat = dat.reset_index(drop=True)

# Convert 'ds' column to datetime
dat['ds'] = pd.to_datetime(dat['ds'], format='%Y-%m-%d')

# Set the datetime index with frequency='D' (daily)
dat = dat.set_index('ds').asfreq('D')

# Fill missing values with forward fill
dat['y'].fillna(method='ffill', inplace=True)

# Perform seasonal decomposition
result = sm.tsa.seasonal_decompose(dat['y'], model='additive')

# Plot the trend component for the first 200 data points
result.trend.iloc[1:200].plot(figsize=(12, 6), title='Trend Component (First 200 Data Points)')
plt.xlabel('Date')
plt.ylabel('Trend')
plt.grid(True)
plt.tight_layout()
plt.show()

print("It is obvious to capture the trends here, along with time the incremental of stock price getting larger.\nAnd the trends is positive for sure!")
It is obvious to capture the trends here, along with time the incremental of stock price getting larger.
And the trends is positive for sure!

Seasonal Component¶

In [15]:
result.seasonal.iloc[1:100].plot(figsize=(12, 6), title='Seasonal Component (First 100 Data Points)')
plt.xlabel('Date')
plt.ylabel('Seasonal Effect')
plt.grid(True)
plt.tight_layout()
plt.show()

print("It seems like there are no seasonal effects or very short piecies of period exiting seasonal trends")
It seems like there are no seasonal effects or very short piecies of period exiting seasonal trends

Residuals Component¶

In [16]:
result.resid.iloc[1:].plot(figsize=(12, 6), title='Residuals Component (All Data Point)')
plt.xlabel('Date')
plt.ylabel('Residuals')
plt.grid(True)
plt.tight_layout()
plt.show()

print("We can see that the residuals magnifying as time goes by, which means the residuals is not stationary.\nIn other words, there are increasing anomalies when we using STD model to try to capture the real trends.")
We can see that the residuals magnifying as time goes by, which means the residuals is not stationary.
In other words, there are increasing anomalies when we using STD model to try to capture the real trends.

4. The Prophet Module¶

In [17]:
from prophet import Prophet
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.offline as py
py.init_notebook_mode()
%matplotlib inline
In [18]:
dat = pd.read_csv('Amazon.csv')
dat = dat[['Date','Close']]
dat.columns = ['ds','y']

# Fitting with default parameters
dat_model_0 = Prophet(daily_seasonality=True)
dat_model_0.fit(dat)

print("I initialized a Facebook Prophet model with daily seasonality enabled, acknowledging the daily patterns in stock price fluctuations.\n")
print("This Prohet model can used to learn and capture the underlying patterns and trends in Amazon's stock prices for future forecasting and analysis.")
00:52:20 - cmdstanpy - INFO - Chain [1] start processing
00:52:22 - cmdstanpy - INFO - Chain [1] done processing
I initialized a Facebook Prophet model with daily seasonality enabled, acknowledging the daily patterns in stock price fluctuations.

This Prohet model can used to learn and capture the underlying patterns and trends in Amazon's stock prices for future forecasting and analysis.
In [19]:
future= dat_model_0.make_future_dataframe(periods=20, freq='d')
future.tail()

print("By creating 20 future timestamp entries at daily intervals,\nwhich serves as a basis for conducting time series forecasting using the Facebook Prophet model.")
By creating 20 future timestamp entries at daily intervals,
which serves as a basis for conducting time series forecasting using the Facebook Prophet model.
In [20]:
dat_model_0_data=dat_model_0.predict(future)
dat_model_0_data.tail()
Out[20]:
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper daily ... weekly weekly_lower weekly_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
6170 2021-11-12 3423.655898 3238.897681 3561.345435 3423.655898 3423.655898 -20.297988 -20.297988 -20.297988 -8.431807 ... -1.858536 -1.858536 -1.858536 -10.007645 -10.007645 -10.007645 0.0 0.0 0.0 3403.357910
6171 2021-11-13 3425.180998 3247.565206 3572.819535 3425.180998 3425.180998 -17.372189 -17.372189 -17.372189 -8.431807 ... 1.053974 1.053974 1.053974 -9.994356 -9.994356 -9.994356 0.0 0.0 0.0 3407.808809
6172 2021-11-14 3426.706097 3265.640672 3573.201261 3426.706097 3426.706097 -17.304978 -17.304978 -17.304978 -8.431807 ... 1.053974 1.053974 1.053974 -9.927144 -9.927144 -9.927144 0.0 0.0 0.0 3409.401120
6173 2021-11-15 3428.231197 3234.355894 3579.547531 3428.231197 3428.231197 -19.924146 -19.924146 -19.924146 -8.431807 ... -1.679599 -1.679599 -1.679599 -9.812740 -9.812740 -9.812740 0.0 0.0 0.0 3408.307051
6174 2021-11-16 3429.756297 3259.808963 3580.840221 3429.756297 3429.756297 -17.863908 -17.863908 -17.863908 -8.431807 ... 0.226597 0.226597 0.226597 -9.658698 -9.658698 -9.658698 0.0 0.0 0.0 3411.892389

5 rows × 22 columns

In [21]:
future= dat_model_0.make_future_dataframe(periods=20, freq='d')
future.tail()

dat_model_0_data=dat_model_0.predict(future)
dat_model_0_data.tail()

from prophet.plot import add_changepoints_to_plot
# Create a Prophet model
dat_model_0 = Prophet()

# Fit the model to your data
dat_model_0.fit(dat)

# Create a future DataFrame for forecasting
future = dat_model_0.make_future_dataframe(periods=20, freq='D')
future.tail()

# Make predictions on the future DataFrame
dat_model_0_data = dat_model_0.predict(future)
dat_model_0_data.tail()

# Plot the forecast
fig = dat_model_0.plot(dat_model_0_data)
a = add_changepoints_to_plot(fig.gca(), dat_model_0, dat_model_0_data)

print("The main line in the graph represents the forecasted values of the time series data. This line provides predictions for future values based on historical patterns and the model's learned trends.\n")
print("The shaded areas represent uncertainty intervals, indicating the range within which the actual future values are likely to fall. The wider the uncertainty interval, the higher the uncertainty in the predictions.\n")
print("The points where vertical dashed lines intersect the time series line are potential changepoints. Changepoints are significant shifts in the Amazon's stock price.")
00:52:25 - cmdstanpy - INFO - Chain [1] start processing
00:52:27 - cmdstanpy - INFO - Chain [1] done processing
The main line in the graph represents the forecasted values of the time series data. This line provides predictions for future values based on historical patterns and the model's learned trends.

The shaded areas represent uncertainty intervals, indicating the range within which the actual future values are likely to fall. The wider the uncertainty interval, the higher the uncertainty in the predictions.

The points where vertical dashed lines intersect the time series line are potential changepoints. Changepoints are significant shifts in the Amazon's stock price.

Examine the “trend”, the “weekly” pattern, and the “monthly” pattern¶

In [22]:
dat_model_0.plot_components(dat_model_0_data)
Out[22]:

Conclusion¶

  • Simple Moving Average (SMA): SMA is effective at identifying anomalies in Amazon's all-time stock price data because it smoothes out short-term fluctuations and highlights the underlying trend. Anomalies are often characterized by sudden, short-term deviations from the long-term upward trend. By comparing each data point to the moving average, SMA can readily flag periods when the stock price significantly deviates from the smoothed trend, making it a robust tool for identifying short-term anomalies.
  • Exponential Smoothing: Exponential smoothing generates a smoothed forecast by giving more weight to recent data points. Anomalies are detected by comparing actual stock prices to forecasted values. By calculating the residuals which are differences between actual and forecasted values, it identifies anomalies as large positive or negative residuals, indicating unexpected deviations from the expected stock price trajectory. This model's sensitivity to long-term trends and gradual shifts makes it effective at spotting anomalies associated with sustained changes.
  • Seasonal-Trend Decomposition (STL): STL decomposes the time series data into seasonal, trend, and residual components. It is effective at identifying anomalies because it explicitly separates seasonality and trend from the remainder (residuals). Unusual patterns or abrupt changes that don't conform to the expected seasonality or trend can be considered anomalies. STL helps differentiate between regular market behavior and irregular events impacting stock prices.
  • Prophet Module: Prophet is designed to handle time series data with complex components like trends, seasonality, and holidays. Since Amazon's stock price exhibits an upward trend over the long term, Prophet is particularly effective at identifying anomalies that deviate from this. It identifies anomalies by comparing observed stock prices to forecasted values and their associated prediction intervals. Anomalies are detected when observed values fall outside these intervals. This approach is robust for capturing complex patterns and anomalies within the context of the upward trend.
To sum up, each of these models has its strengths and can identify anomalies in Amazon's all-time stock price data from different perspectives. Combining the results from multiple models can provide a more comprehensive view of anomalies, as each model may excel at capturing certain types of irregularities.¶

Reference¶

  • Data Source: https://www.kaggle.com/datasets/kannan1314/amazon-stock-price-all-time